Fix cluster restart failure after bootstrap with built-in log service#266
Fix cluster restart failure after bootstrap with built-in log service#266
Conversation
|
Important Review skippedBot user detected. To trigger a single review, invoke the You can disable this status message by setting the Comment |
This fix ensures that after initializing the log agent, we immediately refresh the leader information for all log groups. This is critical after cluster restart scenarios where log group leaders may have changed during the previous session (e.g., during bootstrap). The RefreshLeader call will query the current leader from each log group, updating the leader cache with accurate information before any log operations begin. This prevents connection failures that occur when trying to communicate with stale/incorrect leader nodes. Co-authored-by: githubzilla <10080336+githubzilla@users.noreply.github.com>
|
|
1 similar comment
|
|
Added INFO and DEBUG level logging to track leader refresh operations for better observability and debugging. This helps verify that leader discovery is working correctly during cluster initialization and restart scenarios. Co-authored-by: githubzilla <10080336+githubzilla@users.noreply.github.com>
- Add warning when no log groups are found to detect configuration issues - Add comment explaining RefreshLeader is lightweight and handles errors - Improve code robustness and observability Co-authored-by: githubzilla <10080336+githubzilla@users.noreply.github.com>
- Fix multi-line string formatting for better readability - log_group_count is already stored in a variable (no redundant calls) - Improved code consistency Co-authored-by: githubzilla <10080336+githubzilla@users.noreply.github.com>
The _codeql_detected_source_root file is a temporary artifact from security scanning and should not be committed to the repository. Co-authored-by: githubzilla <10080336+githubzilla@users.noreply.github.com>
Here are some reminders before you submit the pull request
./mtr --suite=mono_main,mono_multi,mono_basicDescription
During bootstrap, log service raft groups undergo multiple leader elections. On restart, tx_service initializes with stale leader cache, causing connection failures to log groups.
Changes
Added proactive leader discovery after log_agent initialization in
Sharder::Init():Key points:
RefreshLeader()queries current leader from each log group immediately after initializationImpact
Resolves connection failures during cluster restart after bootstrap. Part of coordinated fix with log_service #27, eloq_log_service #33, eloqdoc #345, eloqkv #329, eloqsql #177.
Original prompt
This section details on the original issue you should resolve
<issue_title>[Bug]: Fix cluster start failure after bootstrap with built-in log service</issue_title>
<issue_description>- Software version (the version of the software you're running):(required)
e.g. 0.0.1
Reproduce steps (describe steps of reproducing the bug):(required)
Expected result (describe the right result as expected):(required)
Actual result (describe the incorrect result actually happened):(required)
Related issues (references to the related issues, e.g. eloqdb/repository#number):(optional)
</issue_description>
Comments on the Issue (you are @copilot in this section)
@githubzilla # Log Service Raft Group Leader Change AnalysisCluster Configuration
Timeline of Leader Changes
Phase 1: Initial Startup (Single Node)
Time: 15:11:18.727 - Node 0 becomes leader (Term 3)
Phase 2: Cluster Expansion - Node 1 Joins
Time: 15:11:30.818 - Node 1 becomes leader (Term 4)
Observation: Node 0 had log entries up to index 11 from term 2, but Node 1's term 4 entries start from index 1, causing log mismatch issues when Node 1 tries to replicate.
Phase 3: Leadership Transfer Request
Time: 15:11:35.837 - Leadership Transfer Initiated
Phase 4: Node 0 Wins Election
Time: 15:11:37.827 - Node 0 becomes leader (Term 5)
Issue: Node 0 still has log entries from term 2 (index 13), but Node 2 rejects AppendEntries because of term mismatch (Node 2 expects term 6 entries but Node 0 has term 2 entries).
Phase 5: Node 2 Wins Election
Time: 15:11:39.785 - Node 2 becomes leader (Term 6)
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.